Optimize CmSketch AVX block access and in-register updates for frequency estimate and increment#783
Optimize CmSketch AVX block access and in-register updates for frequency estimate and increment#783
Conversation
Agent-Logs-Url: https://github.com/bitfaster/BitFaster.Caching/sessions/7289ae1d-0fa0-43fd-bbe1-2511d41eb3d9 Co-authored-by: bitfaster <12851828+bitfaster@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bitfaster/BitFaster.Caching/sessions/7289ae1d-0fa0-43fd-bbe1-2511d41eb3d9 Co-authored-by: bitfaster <12851828+bitfaster@users.noreply.github.com>
| #endif | ||
| { | ||
| Vector128<ushort> count = Avx2.PermuteVar8x32(Avx2.And(Avx2.ShiftRightLogicalVariable(Avx2.GatherVector256(tablePtr, blockOffset, 8), indexLong), Vector256.Create(0xfL)).AsInt32(), Vector256.Create(0, 2, 4, 6, 1, 3, 5, 7)) | ||
| Vector256<long> lower = Avx.LoadVector256(tablePtr + block); |
There was a problem hiding this comment.
@copilot can this use the load aligned vector method instead to make it faster?
There was a problem hiding this comment.
Updated in cfbd9ac to use aligned AVX loads/stores on the pre-aligned table pointer, and corrected the pointer alignment math so those aligned accesses are safe.
Agent-Logs-Url: https://github.com/bitfaster/BitFaster.Caching/sessions/68029771-3d41-4b5f-8733-4bb8c9d8ab7c Co-authored-by: bitfaster <12851828+bitfaster@users.noreply.github.com>
Agent-Logs-Url: https://github.com/bitfaster/BitFaster.Caching/sessions/68029771-3d41-4b5f-8733-4bb8c9d8ab7c Co-authored-by: bitfaster <12851828+bitfaster@users.noreply.github.com>
| Vector256<long> upper = Avx.LoadVector256(tablePtr + block + 4); | ||
| #endif | ||
|
|
||
| long lower0 = lower.GetElement(0); |
There was a problem hiding this comment.
@copilot can these operations be vectorized so that all computation is done in vector registers?
There was a problem hiding this comment.
Yes — updated in 07f4d28 so the IncrementAvx lane selection, saturation check, and counter increment stay in vector registers, with only the final block stores leaving the registers.
Agent-Logs-Url: https://github.com/bitfaster/BitFaster.Caching/sessions/c76dc857-c67e-4ce6-a05f-2ebb894fcc0b Co-authored-by: bitfaster <12851828+bitfaster@users.noreply.github.com>
|
From iteration 1
|
|
From commit 6
|
| int lane2 = h2 & 1; | ||
| int lane3 = (h3 & 1) + 2; | ||
|
|
||
| Vector256<ulong> lowerIndex = Vector256.Create( |
There was a problem hiding this comment.
@copilot can this be written so that the values are loaded into vector registers without any branches, and there is a mask for the lanes that sets values to zero after they are loaded. That way all of this can be expressed mainly with vector assign and masking without the branches.
There was a problem hiding this comment.
Updated in aabff69 to remove the branchy lane setup: the block halves are still loaded directly into vector registers, and the active lanes are now selected with vector compare masks so inactive lanes are zeroed by masking in-register.
Agent-Logs-Url: https://github.com/bitfaster/BitFaster.Caching/sessions/cdf1657f-7869-4fc8-a0a6-43ccc27a3f87 Co-authored-by: bitfaster <12851828+bitfaster@users.noreply.github.com>
CmSketchCore's AVX2 paths were still doing per-counter gathers and scalar table updates. This change restructures both hot paths to operate on the 64-byte sketch block as two contiguousVector256<long>loads, reducing memory traffic and keeping counter selection/update local to the loaded vectors.What changed
EstimateFrequencyAvxnow reads the target block once as two consecutiveVector256<long>values and extracts the four candidate counters from those vectors instead of gathering four independent longs.IncrementAvxnow follows the same block-oriented pattern: load both halves of the block, compute the selected counter updates in-register, then write the updated block back with two contiguous vector stores.IncrementAvxupdate path no longer drops back to scalar lane extraction/reconstruction: lane selection, saturation checks, and increments are now performed with vector masks and variable shifts.Hot-path shape
blockandblock + 4.Resulting AVX flow